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Abstract —Obstacle detection plays an important role in unmanned surface vehicles (USV). The USVs operate in highly diverse 
environments in which an obstacle may be a floating piece of wood, a scuba diver, a pier, or a part of a shoreline, which presents a 
significant challenge to continuous detection from images taken onboard. This paper addresses the problem of online detection by 
constrained unsupervised segmentation. To this end, a new graphical model is proposed that affords a fast and continuous obstacle 
image-map estimation from a single video stream captured onboard a USV. The model accounts for the semantic structure of marine 
environment as observed from USV by imposing weak structural constraints. A Markov random field framework is adopted and a highly 
efficient algorithm for simultaneous optimization of model parameters and segmentation mask estimation is derived. Our approach 
does not require computationally intensive extraction of texture features and comfortably runs in real-time. The algorithm is tested on a 
new, challenging, dataset for segmentation and obstacle detection in marine environments, which is the largest annotated dataset of its 
kind. Results on this dataset show that our model outperforms the related approaches, while requiring a fraction of computational 
effort. 

Index Terms —Obstacle map estimation, Autonomous surface vehicles, Markov random fields, Gaussian mixture models. 
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1 Introduction 

O BSTACLE detection is of central importance for lower- 
end small unmanned surface vehicles (USV) used for 
patrolling coastal waters (see Figure [lj. Such vehicles are 
typically used in perimeter surveillance, in which the USV 
travels along a pre-planned path. To quickly and efficiently 
respond to the challenges from highly dynamic environ¬ 
ment, the USV requires an onboard logic to observe the 
surrounding, detect potentially dangerous situations, and 
apply proper route modifications. An important feature of 
such vessel is the ability to detect an obstacle at sufficient 
distance and react by replanning its path to avoid collision. 
The primary type of obstacle in this case is the shoreline 
itself, which can be avoided to some extent (although not 
fully) by the use of detailed maps and the satellite naviga¬ 
tion. Indeed, Heidarsson and Sukhatme 111 proposed an ap¬ 
proach that utilizes an overhead image of the area obtained 
from Google maps to construct a map of static obstacles. 
But such an approach cannot handle a more difficult class 
of dynamic obstacles that do not appear in the map (e.g., 
boats, buys and swimmers). 

A small USV requires ability to detect near-by and 
distant obstacles. The detection should not be constrained 
to objects that stand out from the water, but should also 
detect flat objects, like debris or emerging scuba divers, etc. 
Operation in shallow waters and marinas constrains the size 
of USV and prevents the use of additional stabilizers. This 
puts further constraints on the weight, power consumption. 
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Fig. 1. Our approach to obstacle image-map estimation. 


types of sensors and their placement. Cameras are therefore 
becoming attractive sensors for use in low-end USVs due to 
their cost-, weight- and power-efficiency and a large field of 
view coverage. This presents a challenge for development 
of highly efficient computer vision algorithms tailored for 
obstacle detection in a challenging environments that the 
small USVs face. In this paper we address this challenge 
by proposing a segmentation-based algorithm for obstacle- 
map estimation that is derived from optimizing a new well- 
defined graphical model and runs at over 70fps in Matlab 
on a single core machine. 


1.1 Related work 

The problem of obstacle detection has been explicitly or 
implicitly addressed previously in the field of unmanned 
ground vehicles (UGV). In a trail-following application Ras¬ 
mussen et al. ( 5 ) use an omnidirectional camera to detect 
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trail as a region that is most contrasted to its surround¬ 
ing, however, dynamic obstacles are not addressed. Several 
works, e.g., Montemerlo et al. (3) and Dahlkamp et al. (4j, 
address the problem of low-proximity road detection with 
laser scanners by bootstrapping color segmentation with the 
laser output. The proximal road points are detected by laser, 
projected to camera and used to learn a Gaussian mixture 
model which is in turn used to segment the rest of the image 
captured by the camera. Combined with horizon detection 
of Ettinger et al. |5|, this approach significantly increases the 
distance at which the obstacles on the road can be detected. 
Alternatively, Lu and Rasmussen (6) casted the obstacle 
detection as a labelling task in which they employ a bank 
of pre-trained classifiers to 3D point clouds and a Markov 
random field to account for the spatial smoothness of the 
labelling. 

Most UGV approaches for obstacle detection explicitly 
or implicitly rely on ground plane estimation from range 
sensors and are not directly applicable to aquatic environ¬ 
ments encountered by USV. Rankin et al. 0 propose a 
specific body-of-water detector in wide open areas from 
a UGV using a monocular color camera. Their detector 
assumes that, in an undisturbed water surface, a change in 
saturation-to-brightness ratio across a water body from the 
leading to trailing edge is uniform and distinct from other 
terrain types. They apply several ad-hoc processing steps to 
gradually grow the water regions for the initial candidates 
and apply a sequence of pre-set thresholds to remove spuri¬ 
ous false detections of water pools. However, their method 
is based on the undisturbed water surface assumptions, 
which is violated in coastal and open water applications. 
Scherer et al. |8j propose a water detection algorithm using 
a stereo bumblebee camera, IMU/GPS and rotating laser 
scanner for navigation on a river. Their system extracts color 
and texture features over blocks of pixels and eliminates 
the sky region using a pre-trained classifier. A horizon line, 
obtained from the onboard IMU, is then projected into the 
image to obtain samples for learning a color distribution of 
the regions below and above horizon, respectively. Using 
these distributions, the image is segmented and results of 
the segmentation are used in turn, after additional post¬ 
processing steps, to train a classifier. The trained classifier 
is fused with a classifier from the previous frames and 
applied to the blocks of pixels to detect the water region. 
This system relies heavily on the quality of hardware-based 
horizon estimation, accuracy of pre-trained sky detector 
and the postprocessing steps. The authors report that the 
vision-based segmentation is not processed onboard, but 
requires special computing hardware, which makes it below 
a realtime segmentation at constrained processing power 
typical for small USVs. 

Some of the standard range sensor modalities for au¬ 
tonomous navigation in maritime environments include 
radar |9|, sonar jlO| and ladar 0. Range scanners are 
known to poorly discriminate between water and land in 
the far field GD suffer from angular resolution and scan¬ 
ning rate limitations, and poorly perform when the beam's 
incidence angle is not oblique with respect to the water 
surface Git 1131. Several researchers have thus resorted 
to cameras, e.g., 0, 113 1 , (IS) , 115 1 , |l6|, (17|, for obstacle 
and moving object detection instead. To detect dynamic 


objects in harbor, Socek et al. fl4) assume a static camera 
and apply background subtraction combined with motion 
cues. However, background subtraction cannot be applied 
to a highly dynamic scenes encountered on a moving USV. 
Huntsberger et al. |17[ attempt to address this issue using 
stereo systems, but require large baseline rigs that are less 
appropriate for small vessels due to increased instability 
and limit processing of near-field regions. Santana et al. (13) 
apply fusion of Lukas Kanade local trackers with color 
oversegmentation and a sequence of k-means clusterings 
on texture features to detect water regions in videos. Al¬ 
ternatively, Fefilatyev and Goldgof 1151 and Wang et al. 1161 
apply a low-power solution using a monocular camera for 
obstacle detection. They first detect the horizon line and 
then search for a potential obstacle in the region below the 
horizon. A fundamental drawback of these approaches is 
that they approximate the edge of water by a horizon line 
and cannot handle situations in coastal waters, close to the 
shoreline or in marina. At that point, the edge of water does 
not correspond to the horizon anymore and can be no longer 
modeled as a straight line. Such cases call for more general 
segmentation approaches. 

Many unsupervised segmentation approaches have been 
proposed in literature. Khan and Shah 1181 use optical flow, 
color and spatial coordinates to construct features which 
are used in single Gaussians to segment a moving object 
in video. Nguyen and Wu 1191 propose Student-t mixture 
models for robustifying segmentation. Improved segmenta¬ 
tion can be achieved by applying Bayesian regularization 
scheme in Gaussian mixture models, however, care has 
to be taken at initialization (20). Felzenswalb and Hutten- 
locher (2l]| have proposed a graph-theoretic clustering to 
perform segmentation of color images into visually-coherent 
regions. The assumption that the neighboring pixels likely 
belong to the same class is formally addressed in the context 
of Markov random fields (MRF) (22) (23) . By constrain¬ 
ing the solutions of the segmentations to mimic high-level 
semantics of urban scenes, Felzenszwalb and Veksler (24) 
proposed a three-strip segmentation algorithm that can be 
implemented by a dynamic program. Wojek and Schiele (25) 
have extended the conditional random fields with dynamic 
models and perform the inference for object detection and 
labeling jointly in videos. The random field frameworks 1261 
have proven quite successful for addressing the semantic 
labeling tasks and recently Kontschieder et al. (27) have 
shown that structural priors between classes further im¬ 
prove the labeling. Alternative schemes that avoid applying 
a MRF to enforce spatial consistency have been proposed, 
e.g., Chen et al. (28) and Nguyen et al. (29) . The approaches 
like Wojek et al. 125) use high-dimensional features com¬ 
posed of color and texture at multiple scales and object- 
class specific detectors to segment the images and detect 
the objects of interest. In our scenarios, the possible types 
of dynamic obstacles are unknown and vary significantly in 
appearance. Thus object-class specific detectors are not suit¬ 
able. Several bottom-up graph-theoretic approaches have 
been proposed for unsupervised segmentation, e.g., (30) , 
1 37), (32), (33). Recently, Alpert et al. (32) have proposed 
an approach that starts from a pixel level and gradually 
constructs visually-homogenous regions by agglomerative 
clustering. They achieved impressive results on a segmenta- 
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tion dataset in which an object was occupying a significant 
portion of an image. Unfortunately, since their algorithm 
incrementally merges regions, it is too slow for online ap¬ 
plication even at moderate image sizes. An alternative to 
starting the segmentation from pixel level is to start from 
an oversegmented image such that pixels are grouped into 
superpixels (34). Lu et al. (35) apply spectral clustering to 
an affinity graph induced over a superpixelated image. Li et 
al. (33) have proposed a segmentation algorithm that uses 
multiple superpixel oversegmentations and merges their re¬ 
sult by a bipartite graph partitioning to achieve state-of-the- 
art results on a standard segmentation dataset. However, 
no prior information is provided to favor certain types of 
segmentations in specific scenes. 

1.2 Our approach 

We pursue a solution for obstacle detection that is based on 
concepts of image segmentation with weak semantic priors 
on the expected scene composition. Figure [2] shows typical 
images captured from a USV. While the images significantly 
vary in appearance, we observe that each image can be split 
into three semantic regions roughly stacked one above the 
other, implying a structural relation between the regions. 
The bottom region represents the water, while the top region 
represents the sky. The middle component can represent 
either land, parked boats a haze above horizon or a mixture 
of these. 
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Fig. 2. Images captured from the USV split into three semantically 
different regions. 

Our main contribution is a graphical model for 
structurally-constrained semantic segmentation with ap¬ 
plication to USV obstacle-map estimation. The generative 
model assumes a mixture model with three Gaussian com¬ 
ponents for the dominant three image regions and a uniform 
component for explaining the outliers, which may constitute 
an obstacle in the water. Weak priors are assumed on the 
mixture parameters and a MRF is placed over the prior as 
well as posterior pixel-class distributions to favor smooth 
segmentations. We derive an EM algorithm for the proposed 
model and show that the resulting optimization achieves 
a fast convergence at a low computational cost, without 
resorting to a specialized hardware. A similar graphical 
model was proposed by Diplaros et al. |[36), but their model 
requires a manually set variable, does not apply priors and 
is not derived from a single density function. Our model 
is applied to obstacle image-map estimation in USVs. The 
proposed model acts directly on color image and does 
not require expensive extraction of texture-based features. 


Combined with efficient optimization, this results in real¬ 
time segmentation and obstacle-map estimation (several¬ 
fold faster than the camera frame rate). Our approach is 
outlined in Figure [l] The semantic model is fitted to the 
input image, after which each pixel is classified into one of 
the four classes. All the pixels that do not correspond to the 
water component are deemed to be a part of an obstacle. 
Figure [l] shows a detection of a dynamic obstacle (buoy) 
and of a static obstacle (shoreline). Our second contribution 
is a marine dataset for semantic segmentation and obstacle 
detection, and the performance evaluation methodology. To 
our knowledge this will be the largest annotated publicly 
available marine dataset of its kind up to date. 

A preliminary version of our algorithm was presented in 
Kristan et al. (37) and is extended in this paper on several 
levels. Additional discussion and related work is provided. 
Improved initialization of segmentation model by soft-resets 
of the parameters is proposed and additional details of the 
algorithm and the dataset are provided. In particular, the 
dataset capturing procedure and annotation is discussed 
and additional statistics of the obstacles in the dataset are 
provided. The experiments are extended by performance 
analysis with respect to the color space, the obstacle size and 
the time-of-day driving conditions. The learning of priors 
used in our model is discussed in detail and the dataset is 
extend with training images used for estimating the priors. 

Our approach is most closely related to the works 
in urban-scene parsing by Felzenszwalb and Veksler (24) 
and maritime scene understanding by Fefilatyev and 
Golggof (15), Wang et al., 116] and Scherer et al. |8|. There are 
notable differences between these approaches and ours. The 
first difference to (24) is that they only address the labeling 
part of the segmentation problem and require precomputed 
per-pixel label confidences. The second difference is that 
their approach produces segmentations with homogenous 
bottom region, which prevents detection of obstacles with¬ 
out further postprocessing. In contrast, our approach jointly 
learns the component appearance, estimates the per-pixel 
class probabilities, and optimizes the segmentation within 
a single online framework. Furthermore, learning the pa¬ 
rameters of 1241 is not as straightforward. Compared to the 
related water segmentation algorithms for maritime appli¬ 
cations (i.e., 181, (15), 116]), our approach completely avoids 
the need for a good horizon estimation. Nevertheless, the 
proposed probabilistic model is general enough to directly 
incorporate this information if available. 

The remainder of the paper is structured as follows. 
In Section [2] we derive our semantic generative model, in 
Section |3| we present the obstacle detection algorithm, in 
Section HTwe detail the implementation and learning of the 
priors, in Section [5] we present the new dataset and the 
accompanying evaluation protocol, in Section [6] we exper¬ 
imentally analyze the algorithm and draw conclusions in 
Section |7| 

2 The semantic generative model 

We consider the image as an array of measured values 
Y = {yi}i=i:Mf which E lZ d is a d dimensional 
measurement, a feature vector, at the i -th pixel in an image 
with M pixels. As we detail in the subsequent sections. 
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Fig. 3. The graphical model for semantic segmentation. 


the feature vector is composed of pixel's color and image 
coordinates. The probability of the i-th pixel feature vector 
is modelled as a mixture model with four components - 
three Gaussians and a single uniform component: 

3 

p(y»|0) = 'Yl l 4 i (yiWk,'£‘k)'Kik + w(yi)7r i4 , (l) 

fc=i 

where 0 = {lik^k}k= 1:3 are the means and covariances 
of the Gaussian kernels </>(-|/i, E) and U(-) is a uniform 
distribution. The i -th pixel label Xi is an unobserved ran¬ 
dom variable governed by the class prior distribution 
7Ti = [TTi 1 ,..., ITU, ■ ■■, Kul With TTu = p(x t = l). The three 
Gaussian components represent the three dominant seman¬ 
tic regions in the image, while the uniform component 
represents the outliers, i.e., pixels that do not likely corre¬ 
spond to any of the three structures. To encourage segmen¬ 
tations into three approximately vertically aligned semantic 
structures, we define a set of priors <po = , ^^ k }k=i :3 

for the mean values of the Gaussians, i.e., p(G\ipo) = 
nLi E/ufe)- To encourage smooth segmentations, 

the priors 77^ as well as posteriors over the pixel class labels, 
are treated as random variables, which form a Markov ran¬ 
dom field. Imposing the MRF on the priors and posteriors 
rather than pixel labels allows effectively integrating out the 
labels, which leads to a well-behaved class of MRFs 1361 that 
avoid image reconstruction during parameter learning. The 
resulting graphical model with priors is shown in Figure [3] 
Let 7 r = {7ri}i = i : M denote the set of priors for all pixels. 
Following |22) we approximate the joint distribution over 
the priors as p(tt) & ^£>( 77 ^ 77 ^), an d 'T/v* is a mixture 
distribution over the priors of the i-th pixel's neighbors, i.e., 
7r Ni = J2jeNi,j^i where A# are fixed positive weights 
such that for each i-th pixel J2j \j = 1* The potentials in 
the MRF are defined as 

p{TTi\TT Ni ) oc exp(-i£'(7r i ,7r JVi )), (2) 

with the exponent defined as 

E(TTi,TT Ni ) = D(TTi || 71-jvJ + i?(7Tj). (3) 

The term Din, || 7 rjvJ is the Kullback-Leibler divergence 
which penalizes the differences between prior distributions 
over the neighboring pixels ( 7 and 77 ^), while the term 
H(7Ti) is the entropy defined as 

4 

H{jTi) = - y] 7 Tik log 7 T ik , (4) 

i =1 

which penalizes uninformative priors 7 r*. The joint distribu¬ 


tion for the graphical model in Figure [3] can be written as 

M 

p(Y,0,7r|</?o) = Pp(yi|0,^o)p(©|^o)p(7r i |7r iVi ). (5) 

i=1 

Diplaros et al. j36j argue that improved segmentations can 
be achieved by also considering an MRF directly on the 
pixel posterior distributions by treating the posteriors as 
random variables P = { Pi}i=i:M / where the components 
of pi are defined as Pik = p(xi = fc|0,y^(po)/ com¬ 
puted by Bayes rule from p(yi\xi = 0) and p{xi = k). 

We can write the posterior over P as p(P|Y, 0, 77 , <po) ck 
n“iVvHE{ Pi ,p Ni )), where is a mixture defined 
in the same spirit as i tn v The joint distribution can now be 
written as 

M 

p(P, Y,0,7r|^o) oc exp[y log p(yi,e |</?o) 
i= 1 

+ E(pi, PjvJ)], (6) 

Due to coupling between 7 ^/ 77 ^- and Pi/p n x the optimiza¬ 
tion of ([6j is not straightforward. We therefore introduce 
auxiliary variables and s i and take the logarithm, which 
results in the following cost function 

M 

F = y[ lo gp(yi>0|^o) - 2 ||7Ti ° 7TjVi ) T 2)(cp ||Pi ° PJVj ))], 

4=1 

(7) 

where o is the Hadamard (component-wise) product. Note 
that when = p^ and s* = 77 *, 0 reduces to {6) (ignoring 
the constant terms). Maximization of F can now be achieved 
in an EM-like fashion. In the E-step we maximize F w.r.t. q i, 
si, while the M-step maximizes over the parameters 0 and 
77 . We can see from (0 that the F is maximized w.r.t q^ 
and Si when the divergence terms vanish, therefore, s° pt = 
0 TT/v,, q° pt = £ qi Pi o p Ni , where £ St and £ qi are the 
normalization constants. 


The M-step in not as straightforward, since direct op¬ 
timization over 0 and 77 is intractable and we resort to 
maximizing its lower bound. We define s i = (s^ + s nJ and 
qi = (q* + qjv*) and by Jensen's inequality lower-bound the 
divergence terms as 

- D(Si||7Tj O 7 TjvJ > sf log 7Tj 

--D(qi||pi °ptVi) > q f log Pi, (8) 

where we have ignored the terms independent of 77 ^ and 
p^. Substituting ([8j into (j7j and collecting the relevant terms 
yields the following lower bound on the cost function |7|) 

M 

F = y[-(qi + qjv i ) T log(pip(0|93 O ))+2(si + qi) T log7r i ]. (9) 

i =1 

Differentiating j9| w.r.t., 77 ^ and applying a Lagrange mul¬ 
tiplier with the constraint J2 k 71 ik = 1/ we see that F is 
maximized at 77° pt = |(si + q^). Differentiating jij) w.r.t. the 
means and covariances of Gaussians, we obtain 

M 

fT = Pk 1 l A kCFqikyT)xp - 

i= 1 


( 10 ) 
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M 

s fe pt = Pk 1 - Mfe)(y» - m*) t , (ii) 

i= 1 

where we have defined (3 k = Y^Li Qik and A k = (D^ 1 + 
E-i)" 1 . An appealing property of the model 0 is that its 
E-step can be efficiently implemented through convolutions 
and Hadamard products. Recall that the calculation of the 
i-th pixel's neighborhood prior distribution 7 t/v* entails a 
weighted combination of the neighboring pixel priors iTj. 
Let 7 r. k be the k- th component priors arranged in a matrix of 
image size. Then the neighborhood priors can be computed 
by the following convolution i r/v. fc = 7r. k * A, where A is a 
discrete kernel with its central element set to zero and its 
elements summing to one. Let s. k , q. k and p./c be the image¬ 
sized counterparts corresponding to sets of distributions 
{sj}i=i:M, {qi}i=i:M and {pi} i=i:M / respectively, and let 
Ai denote the kernel A in which the central element is set to 
one. Then the calculation of the k- th component priors 7r opt 
for all pixels in the E-step can be written as 

s.fc = (6- ° TT.fc o (t r.fc * A)) * Ai, 
q-fe = (£?• ° P k ° (p.fe * A)) * Al, 

7T° fc pt = (s-fc + 4fc)/4. (12) 

The EM procedure for fitting our generative model to the 
input image is summarized in Algorithm [l] 

Algorithm 1 : The EM for semantic segmentation. 

Require: 

Pixel features Y = {yi}i=i : M, priors cpo, initial values 
for 0 and i r. 

Ensure: 

The estimated parameters 7r opt , 0 opt and the smoothed 
posterior {q. fe } fe =i: 4 - 

Procedure: 

1: Calculate the pixel posteriors p. k using the current esti¬ 
mates of 7r and 0 for all /c ([lj. 

2: Calculate the new pixel priors 7r° fc pt and posteriors q.^ 
for all k using {12}. 

3: Calculate the new parameter values 0 using {To} and 

GD- 

4: Iterate steps 1 to 3 until convergence. 


3 Obstacle detection 

We formulate the obstacle detection as a problem of estimat¬ 
ing an image obstacle map, i.e., determining the pixels in 
the image that correspond to the sea while all the remaining 
pixels represent the potential obstacles. We therefore first fit 
our semantic model from Section [5] to the input image and 
estimate the smoothed a posteriori probability distribution 
q i k across the four semantic components for each pixel. 
An i — th pixel is classified as water if the corresponding 
posterior q^ reaches maximum for the water component 
among all four components. In our setting the component 
indexed by k = 1 corresponds to water region, which results 
in the labeled image B with the i- th pixel label bi defined as 

1 ; arg max fe q ik = 1 

0 ; otherwise ’ ' ' 


Retaining only the largest connected region in the image 
B results in the current obstacle image map B t . All blobs 
of non-water pixels within the connected water region are 
proclaimed as potential obstacles in the water. This is followed 
by a nonmaxima suppression stage which merges detections 
that are located in close proximities (e.g., due to object 
fragmentation) to reduce multiple detections of the same ob¬ 
stacle. The water edge is extracted as the longest connected 
outer edge of the connected region corresponding to the 
water. The obstacle detection is summarized in Algorithm [ 5 ] 
and visualized in Figure [l] 

Algorithm 2 : The obstacle image map estimation and 
obstacle detection algorithm. 

Require: 

Pixel features Y = {yz};-i : M, priors cpo, estimated 
model from previous time-step ©t-i and q^-i. 

Ensure: 

Obstacle image map B t , water edge e t/ detected objects 
{°i}z=i:AT obj , model parameters Q t and qp. 

Procedure: 

1: Initialize the parameters of <d t and 7r t according to 
Section l3.ll 

2 : Apply the Algorithm [T| to fit the model <d t and q t to the 
input data Y. 

3: Calculate the new obstacle image map B t and for in¬ 
terpretation also the water edge e t and the obstacles in 
water {oi}i, 1:Nobi , 


3.1 Initialization 

The Algorithm [l] requires initial values for the parame¬ 
ters ©t and 7 r t . At the first frame, when no other prior 
knowledge exists, we construct the initial distribution by 
vertically splitting the image into three regions {0,0.2}, 
{0.2, 0.4} and {0.6,1}, written in proportions of the im¬ 
age height (see Figure |3}. A Gaussian is computed from 
each region, thus forming the observed components 0 o bs = 
{/iobs/e,5]obs/c,^obs/c}/c=i:3- The prior over all pixels is ini¬ 
tialized to equal probabilities for the three components, 
while the prior on the uniform component is set to a low 
constant value (see Section [3} . These parameters are used to 
initialize the EM in the Algorithm [l] 

The shape for the vertical splits in Figure[4]should ideally 
follow the position (and inclination) of true horizon for 
optimal initialization of the parameters. An estimate of the 
true horizon depends on the camera placement and can 
ideally be obtained externally from an IMU sensor, but 
the per-frame IMU measurements are not available in the 
dataset that is used in our evaluation (Section |5}. Therefore, 
an assumption is made that the horizon, as well as edge 
of water, is usually located within the region {0.4, 0.6} of 
image height, which is the reason this region is excluded 
from computation of parameter initial values. Making no 
further assumptions regarding the proportion between com¬ 
ponents in the final segmentation, equal regions (2 and 3) 
are used to initialize the parameters of the component 2 and 
3. The assumption on region splitting is often violated in 
our dataset from Section [5] due to boat inclination at turning 
maneuvers, due to boat tilting forward and backward, and 
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Fig. 4. Illustration of image partitioning for extraction of © obs compo¬ 
nents and combination with the model from the previous time-step Q t -1 
for initialization of the EM. 


since the camera might not have been mounted to exactly 
the same spot in assembling the boat after transportation to 
the test site during the several months that the dataset was 
taken. Nevertheless, the segmentation algorithm is robust 
enough to handle the non-ideal initializations as long as 
there are no extreme deviations, like the boat toppling or 
riding on extremely high waves (the small coastal US Vs 
are actually not even designed to physically endure these 
extreme weather conditions). 

During the USV's operation, we can exploit the conti¬ 
nuity of sequential images in the videostream by using the 
parameter values of the converged model from the previ¬ 
ous time-step for initialization of the EM algorithm in the 
current time-step. To reduce possible propagation of errors 
stemming from false segmentations in the previous time- 
steps, a zero-order soft reset is applied in the initialization of 
the EM in each time-step. In particular, the EM is initialized 
by merging the 0 o bs with ©t-i* The parameters of the k- 
th component, {/iinit/c, ^init/c}/ are initialized by forming 
a weighted two-component mixture model from the k -th 
components in 0 o b s and © t _ 1 , and approximating them 
by a single component by matching to first two moments 
of the distributions (see, e.g., Kristan et al. |38j, (39), (40)). 
The weights a and 1 — a for Qt-i and 0 o bs/ respectively, 
can be used to balance the contribution of each component. 
The priors 7T t over the pixels are initialized by the smoothed 
posterior qfrom the previous time-step. The initialization 
is sketched in Figure [4] 

4 Implementation details 

In our application, the measurement at each pixel is encoded 
by a five-dimensional feature vector = [i x , i y , i c iUc 2 , ^ 3 ], 
where (i x ,i y ) are the i-th pixel coordinates and the 
(TciUc 2 Uc 3 ) are the pixel's color channels. We have deter¬ 
mined in a preliminary study that we achieve sufficiently 
good obstacle detection by first performing detection on a 
reduced-size image of 50 x 50 pixels and then rescaling the 
results to the original image size. The rescaling was set to 
match the lower scale of objects of interest, as smaller objects 
do not present danger to the USV. Such approach drastically 
speeds up the algorithm to approximately 10 ms per frame in 
our experiments. The uniform distribution component in jl) 
is defined over the image pixels domain and returns equal 
probability for each pixel. Assuming that all color channels 


are constrained to the interval [ 0 , 1 ], the value of the uniform 
distribution is U(yi) = ^2 at each pixel for our rescaled 
image. The EM optimization requires specification of the 
convolution kernel A. Note that the only constraint on the 
convolution kernel is that its central element is set to zero 
and all elements sum to one. We use a Gaussian kernel with 
its central element set to zero and set the size of the kernel 
to 2% of image size, which results in a 3 x 3 pixels kernel. 
Recall from Section |3d| that the parameter a influences the 
soft-reset of the parameters used to initialize the EM. In 
our implementation, a slightly larger weight is given to the 
parameters estimated at the previous time-step by setting 
a = 0 . 6 . 

4.1 Learning the weak priors 

The spatial components in the feature vector play a dual 
role. First, they enforce to some extent the spatial smooth¬ 
ness of the segmentation on their own. Second, they lend 
means to weakly constraining the Gaussian components 
such that they reflect the three dominant semantic image 
parts. This is achieved by the weak priors p(©|<po) = 
Ylk=i on the Gaussian means. Since the 

locations and shape of semantic components vary signifi¬ 
cantly with the views, we indeed select weak priors, which 
are estimated using the training set from our database (see 
Section [5). Given a set of training images, the prior of the k- 
th component is estimated by extracting the features, i.e. sets 
of yi, corresponding to the k -th component from all images 
and fit a single Gaussian to them. Note that, in general, there 
is a chance that the training images might bias the horizontal 
location of the estimated Gaussian to the left or right part 
of the image. In this case, we could constrain the horizontal 
position of the Gaussians to be in the middle of the image, 
however, we have observed that the components of the prior 
estimated from our dataset are sufficiently centered and we 
do not apply any such constraints. 

Examples of the spatial parts of the priors estimated from 
the training set of the dataset presented in Section [5] are 
shown in Figure [5] Our algorithm, as well as the learning 
routine, was implemented in Matlab - a reference code is 
publicly available atQ 



Fig. 5. Visualization of the spatial part of the Gaussians in the weak 
priors over our three semantic components. From left to right: bottom¬ 
most, middle and top-most component prior. 


5 Marine obstacle detection dataset 

With lack of sufficiently large publicly available annotated 
dataset to test our method, we have constructed our own 
dataset, which we call the Marine obstacle detection dataset 

1. http://www.vicos.si/Research/UnmannedSurfaceVehides 
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(Modd). The Modd consists of 12 video sequences, provid¬ 
ing in total 4454 fully annotated frames with resolution of 
640 x 480 pixels. The dataset is made publicly available 
along with the annotations and Matlab evaluation routines 
from the MODD homepag^] 

The video sequences have been recorded from multiple 
platforms, most of them from the small 2.2 meter USV^] (see 
Figure [TJ. The USV was developed by Harpha Sea, d.o.o. 
Koper, and is based on catamaran hull design and powered 
by electrical, LiPo battery powered, steerable thrust pro¬ 
peller. It can reach the maximum speed of 2.5 m/s and has 
extremely small turn radius. Steering and route planning 
are handled by ARM-powered MCU with redundant power 
supply. For navigation, the MCU relies on microelectrome¬ 
chanical inertial navigation unit (MEMS IMU), solid-state 
digital compass and differential GPS. USV has two differ¬ 
ent communication channels to the shore (high- and low- 
bandwith) and its mission can be programmed remotely. An 
Axis 207W camera was placed on the USV approximately 
0.7 m above the water surface, looking in front of the vessel, 
with an approximately 55° field of view. Camera has been 
set up to automatically adjust to the variations in lighting 
conditions. Since the boat was being reassembled between 
the runs over several months, the placement of the camera 
varies slightly across the dataset. 

The video sequences have been acquired in the gulf of 
Trieste, specifically in the port of Koper, Slovenia, (Figure [§} 
over a period of months at different times of day under 
different weather conditions. The USV was manually op¬ 
erated by a human pilot and effort was made to simulate 
realistic navigation, including threats of collision. The pilot 
was instructed to deliberately drive in a way to simulate 
situations in which an obstacle might present a danger to 
the USV. This includes obstacles being very close to the boat 
as well as situations in which the boat was heading straight 
towards an obstacle for a number of frames. 

The first ten videos in the dataset are meant for eval¬ 
uation of the obstacle-map estimation algorithms under 
normal conditions. These videos still vary quite significantly 
between each other and simulate conditions under which 
the USV is expected to operate. We thus term these ten 
videos as normal conditions and we show some examples 
of these videos in the first ten images from Figure [6] The 
last two videos were meant for analysis in situations in 
which the boat is directly facing the sun. This causes ex¬ 
treme changes in the automatic shutter and camera setting, 
resulting in significant changes of contrast and color of all 
three semantic components. Facing the sun also generates 
significant amount of fragmented glitter, while sometimes it 
shows up as a larger, fully connected region of the reflected 
sun. We thus denote these last two videos as extreme condi¬ 
tions. Some examples are shown in the last two images of 
Figure [6] 

Each frame is annotated manually by a polygon de¬ 
noting the edge of water and bounding boxes are placed 
on large obstacles (those that straddle the water edge) and 
small obstacles (those that are fully surrounded by water). 
See Figure [8] for illustration. The annotation was made by 

2. http://www.vicos.si/Downloads/MODD 

3. A video of our USV is available online from the MODD homepage. 



Fig. 6. Examples of images taken from the videos in the Modd. The first 
ten videos are for normal conditions, while the last two depict extreme 
conditions. For each video we show two images for better impression of 
the video content. 

a human annotator and all annotations on all images of 
the dataset were later verified by an expert. To allow a 
fast overview of the annotations by the potential users of 
the dataset, the dataset provides a rendered video with 
annotations overlay, for each test sequence in the dataset - 
these videos are included as part of the dataset and available 
from the dataset homepage as well. 

In the following some general statistics of the dataset 
are provided. The boat was driving within 200 meters from 
the shore, and most of the depicted obstacles are in this 
zone. Out of 12 image sequences in the dataset, nine contain 
either large or small obstacles, one contains only annotated 
sea edge, and two contain glitter annotations, sea edge 
annotations, and no objects. The number of objects per 
frame is exponentially distributed with the average 1.1 and 
variance 1.23. The distribution of the annotated size of small 
and large obstacles is shown in Figure [ 7 ] For the algorithms 
that require training or validation of their parameters, we 
have compiled a collection of twenty images in which 
we manually annotated the pixels corresponding to three 
semantic components. Figure [lO] shows some examples of 
images and the corresponding annotations. 

5.1 The evaluation protocol 

The evaluation protocol is designed to reflect the two dis¬ 
tinct challenges that the USVs face: the water edge (shore¬ 
line/horizon) detection and obstacle detection. The former 
is measured as the root mean square error (RMSE) of the 
water edge position ( Edg ), and the latter is measured via 
the efficiency of small object detection, expressed as precision 
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Fig. 7. Distributions of the floating obstacle sizes labelled as large and 
small shown in red and blue, respectively. 



Fig. 8. Left: Scene representation used for evaluation. Right: The dashed 
outline denotes the region in the coastal waters of Koper, Slovenia (gulf 
of Trieste), where the dataset was captured. 


(Prec), recall ( Rec), F-score (F) and the average number of 
false positives per frame ( aFP ). 

To evaluate RMSE in water edge position, ground truth 
annotations were used in the following way. A polygon, de¬ 
noting the water surface was generated from water edge an¬ 
notations. Areas, where large obstacles intersect the polygon, 
were removed. Note that, given the scene representation, 
shown in Figure |8j one cannot distinguish between large 
obstacles (e.g., large ships) and stationary elements of the 
shore (e.g., small piers). This way, a refined water edge was 
generated. For each pixel column in the full-sized image, a 
distance between water edge, as given by the ground truth 
and as determined by the algorithm, was calculated. These 
values are summarized into a single Edg value by averaging 
across all frames and videos. 

The evaluation of object detection follows the recom¬ 
mendations from the PASCAL VOC challenges by Evering- 
ham et al. (IT) , with small, application-specific modification: 
all small obstacles (provided as a ground truth or detected) 
that are closer to the annotated water line than 5% of the 
image height, are discarded prior to evaluation on each 
frame. This aims to address situations where a detection 
may oscillate between fully water-enclosed obstacle, and 
the "dent" in the shoreline, resulting in false negatives. 
Figure [9] shows an example with two images of a scubadiver 
emerging from the water. Note that in both images, the 
segmentation successfully labeled the scubadiver as an ob¬ 
stacle. But in the left-hand image we obtain an explicit 
detection, since the estimated water edge runs above the 
scubadiver. In the right-hand image the edge runs below the 
scubadiver and we do not get explicit detection, eventhough 
the algorithm successfully labeled the scubadiver's region 
as being an obstacle. Note that the proposed treatment of 
near-edge detections/ground-truths is also consistent with 
the problem of obstacle avoidance - the USV is concerned 
primarily with the avoidance of the obstacles in its imme- 



Fig. 9. Images with a scubadiver emerging from the water just at 
the observed water edge - the upper row shows the water edge and 
detected obstacles in the water while the bottom row shows the water 
mask. In the left-hand images, the estimated water edge runs above 
the diver and the scubadiver is explicitly detected. In the right-hand 
images the edge runs below the scubadiver, which prevents explicit 
detection, eventhough the region corresponding to the scubadiver is in 
fact detected as a part of the obstacle. 



Fig. 10. Examples of training images along with their manual label 
masks. The blue, green and red color correspond to the labels for 
bottom, middle and the top semantic component, respectively. 


diate vicinity. In counting false positives (FP), true positives 
(TP) and false negatives (FN), we follow the methodology of 
PASCAL VOC, with the minimum overlap set to 0.3. FP, TP 
and FN were used to calculate precision (Prec), recall (Rec), 
F-score (F) and average false positives per frame (aFP). 

6 Experiments 

In the following we will denote our obstacle image-map esti¬ 
mation method (Algorithm^ as the semantic-segmentation 
model (SSM). The experimental analysis was split into three 
parts. In the first part we evaluate the influence of the 
different color spaces on the SSM's performance. In the 
second and third part we analyze how various elements of 
SSM affect its performance and compare it to the alternative 
methods. All experiments were performed on a desktop PC 
with 3.06 GHz Intel Xeon E5-1620 CPU in a single thread in 
Matlab. 

6.1 Influence of the color space 

The aim of the first experiment was to evaluate how the dif¬ 
ferent colorspaces affect the segmentation performance. We 
have therefore performed experiments in which the feature 
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TABLE 1 

Effects of the colorspace on the SSM segmentation performance using 
all twelve videosequences from the Modd. The results are given by 
reporting the average performance with a standard deviation in 
brackets: Edge of water estimation error, precision, recall, F measure 
and average false positives, Edg, Prec, Rec, F, aFP. For each 
performance measure, the best performing method is marked in bold. 


colorspace 

Edg[pix] 

Prec 

Rec 

F 

aFP 

RGB 

10.7(5.8) 

0.874 

0.756 

0.806 

0.039 

HSV 

12.7(8.3) 

0.821 

0.688 

0.738 

0.041 

Lab 

9.3(5.1) 

0.878 

0.768 

0.815 

0.039 

YCrCb 

9.5(5.5) 

0.885 

0.772 

0.819 

0.039 


vector yi (Section]?) was calculated from RGB, HSV, Lab and 
YCrCb colorspace. For each of the selected colorspaces, the 
weak priors were learned from the training images on the 
Modd dataset (Section [5). All experiments were performed 
on all twelve testing videos from the Modd dataset. The 
results are shown in Table [l] 

The results show that best performance is achieved with 
the YCrCb and Lab colorspace, which is not surprising, 
since these colorspaces are known to better cluster visually- 
similar colors. Similar is true for the HSV space, but that 
space suffers from the circular property of the Hue compo¬ 
nent (i.e., red color is on the left-most and right-most part of 
the Hue spectrum). With respect to the edge of the water 
estimation, the lowest error is achieved when using the 
Lab colorspace, while only a slightly worse performance is 
obtained with the YCrCb colorspace. On all other measures, 
the YCrCb colorspace yields best results, although compara¬ 
ble to the Lab colorspace. While the results are worse when 
using the RGB or the HSV colorspace, we note that these 
results do not exhibit drastically poorer performance, which 
speaks of a level of robustness of the SSM to the choice of 
the colorspace. Nevertheless, given the results in Table[lj we 
select the YCrCb and use this colorspace in the subsequent 
experiments. 

6.2 Comparison to alternative approaches 

Given a fixed colorspace, we are left with evaluation of 
how much each part of our model contributes to the fi¬ 
nal performance. We have therefore also implemented two 
variants of our approach, which we denote by UGM and 
UGM co i. In contrast to SSM, the UGM and UGM co i do 
not use the MRF constraints and are in this respect only 
mixtures of three Gaussians with priors on their means and 
with a uniform component. A further difference between 
UGM and UGM co i was that UGM co i ignored the spatial 
information in visual features and relied only on color. 

Note that the SSM is conceptually similar to the Grab- 
cut algorithm from Rother et al. (42) for binary segmenta¬ 
tion, but with distinct differences. In the Grab-cut, the user 
provides a bounding box roughly containing the object, thus 
initializing the segmentation mask. Two visual models using 
a GMM are constructed from this segmentation mask. One 
for the object and one for the background. A MRF is then 
constructed over the pixel grid and graph cut from Boykov 
et al. (43) is used to infer an improved segmentation mask. 
This procedure is then iterated until convergence. There 
are significant differences between the proposed SSM and 


the Grab-cut from m In contrast to the user-provided 
bounding box in m the SSM's weak supervision comes 
from the initialization of the parameters from the previous 
time-step and from the weak priors. The second distinction 
is that our approach does require explicit estimation of the 
segmentation mask to refine the mixture model. This allows 
for a better propagation of uncertainty during the iteration 
of the algorithm, leading to improved segmentation. 

To further evaluate contributions of the particular MRF 
optimization of our SSM, we have implemented a variant of 
the Grab-cut algorithm, which uses our semantic mixture 
model, but applies graph-cuts for optimization over the 
MRF. The resulting obstacle-map estimation tightly follows 
Algorithm [l] and Algorithm [5] with a slight modification of 
the Algorithm [l] After each epoch of the EM, we apply 
the graph-cut from Bagon (44) to segment the image into 
a water/non-water mask. This mask is then used as in 
the original Grab-cut to refine the mixture model. We use 
exactly the same weakly-constrained mixture model as in 
SSM, and the YCrCb colorspace for fair comparison, and 
call this approach the Grab-cut model GCM. 

We have compared our approach also to the general 
segmentation approaches, namely the superpixel-based ap¬ 
proach from Li et al. |33|, SPX, and a graph-based segmen¬ 
tation algorithm from Felzenswalb and Huttenlocher (21) , 
FZH. 

For fair comparison, all the algorithms were executed 
on the 50 x 50 images. We have experimented with the 
parameters of GCM and FZH and have set them to optimal 
performance for our dataset. Since FZH was designed to run 
on larger images, we have also performed the experiments 
for FZH on full-sized images - we denote this variant 
by FZHf u ii. We have performed the comparative analysis 
separately for the normal and extreme conditions. 


6.2.1 Performance under normal conditions 
The results of the experiments on the normal conditions part 
of the Modd are summarized in Table [2] while Figure [lT] 
shows an example of typical segmentation masks from the 
compared algorithms. The segmentation results in these 
images are color coded as follows. The original image is 
represented only by the blue channel, manual water an¬ 
notations are shown in the green channel, and algorithm¬ 
generated water segmentation is shown in the red channel. 
Therefore, the cyan region shows the area, which has been 
annotated as water, but has not been segmented as such 
by the algorithm (bad). The magenta region shows the 
area, which has not been annotated as water, but has been 
segmented as such by the algorithm (bad). The yellow area 
shows the area which has been annotated as water and 
has been segmented as such by the algorithm (good), and 
blue region shows the area which has not been annotated as 
water and has not been segmented as such (good). Finally, 
the darker band under the annotated edge of the water in 
all colors shows the ignore region, where evaluation of small 
obstacle detection does not take place. 

Recall that in contrast to the SSM, the UGM and UGM co i 
do not impose a Markov random field constraint. In Fig¬ 
ure 11 this clearly leads to a poorer segmentation, resulting 


in false positive detections of obstacles as well as significant 
over-segmentations. Quantitatively, the poorer performance 
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TABLE 2 

Comparison of various methods under normal conditions. The results 
are given by reporting average performance with a standard deviation 
in brackets: Edge of water estimation error in pixels (Edg), precision 
(Prec), recall (Rec), F measure (F), average false positives (aFP) and 
time in [ms] ( t ). 



Edg 

Prec 

Rec 

F 

aFP 

t 

SSM 

9.2(4.9) 

0.885 

0.772 

0.819 

0.039 

10 (0) 

GCM 

10.9(5.6) 

0.718 

0.686 

0.695 

0.121 

17(3) 

UGM 

10 .5(6.1) 

0.742 

0.706 

0.717 

0.109 

11 (2) 

UGM col 

16.4(9.0) 

0.614 

0.504 

0.549 

0.122 

11(3) 

FZH 

90.0(65.7) 

0.727 

0.523 

0.551 

0.053 

16(1) 

FZHf u n 

34.2(41.4) 

0.410 

0.747 

0.488 

0.697 

199(3) 

SPX 

66.4(34.7) 

0.007 

0.001 

0.001 

0.090 

54(1) 


Fig. 11. Qualitative comparison of the different obstacle-map estimation 
approaches. The upper left-most image is the original input image, 
followed by the results for SSM, UGM, UGM co i, GCM, FZH, FZH fu n 
and SPX.This image is best viewed in color. Please see text for the 
description of the color codes. 


is reflected in Table [5] as a lower F-measure, higher average 
number of false positives and larger edge of the water 
estimation error. Compared to SSM, we observe a significant 
drop in detection quality of the UGM, especially precision. 
This speaks of importance of the local labeling constraints 
imposed by the MRF in the SSM. The performance further 
drops with UGM co i, which implies that spatial components 
in the feature vectors bear important information for proper 
segmentation as well. On the other hand, the GCM does 
impose a MRF, however, the segmentation is still poorer 
than with the SSM. We believe that the main reason for 
this is that the GCM applies graph-cuts to perform hard 
segmentation during EM updates. On the other hand, the 
SSM optimizes the cost function within a single EM frame¬ 
work, thus avoiding the need for hard segmentations during 
the EM steps, which leads to a better final result. By far 
the worst segmentation results are obtained by the FZH, 
FZHf u n and SPX segmentation methods. Note that while 
these segmentation methods do assume some local consis¬ 
tency of segmentation, they still perform poorer than the 
SSM. The improved performance of SSM can be attributed 
exclusively to our formulation of the segmentation model 
within the graphical model from Figure [5] 


Figure [12] shows further examples of SSM segmentation 
maps (the first fourteen images), the spatial part of the 
Gaussian mixture and the detected objects in water. The 
appearance and texture of the water varies significantly 
between the various scenes, and the same is true for the 
other two semantic components. The images also vary in 
the scene composition in that the vertical position as well 
as the attitude of the water edge (see second row in Fig¬ 
ure 12| vary significantly. Nevertheless, note that the model 
is able to adapt well to these compositions and successfully 
decomposes the scene into water regions, in-water obstacles 
and fairly well delineates the water edge. 

Our algorithm performed (segmentation+detection) at 
a rate higher than 70 frames per second. Most of the 
processing was spent on fitting our semantic model and 
obstacle-map estimation (10 ms), while 4 ms was spent on 
the obstacle detection. For fair comparison of segmentation 
algorithms, we report in the Table [5] only the times required 
for the obstacle-map estimation. Although note that the 
obstacle detection part did require more processing time for 
the methods that delivered poor segmentation masks with 
more false positives. On average, our EM algorithm in SSM 
converged in approximately three iterations. Note that the 
graph cut routine in GCM, part of SPX and the FZH were 
implemented in C and interfaced to Matlab, while all the 
other variants were entirely implemented in Matlab. There¬ 
fore, the computational time results for segmentations are 
not directly comparable among the methods, but still offer 
a level of insight. In terms of processing time, the SSM's 
segmentation was the fastest, running at 100 frames per 
second. The UGM co i and UGM performed approximately 
as fast as SSM, followed by GCM, FZH, SPX and FZH full . 
We conclude that the SSM came out on top as the fastest 
method that also achieved the best detection performance 
as well as accuracy. 

6.2.2 Performance under extreme conditions 
We were interested in measuring two properties of the algo¬ 
rithms under conditions when the boat is facing the sun. In 
particular, were interested in measuring how the sun affects 
the edge-of-water estimation and how the glitter affects the 
detection. We have therefore repeated two variants of the 
experiments on the videos in Modd denoted by the extreme 
conditions (videos 11 and 12 in Figure [6). In the first variant, 
we ignored any detections in the regions that were denoted 
as glitter regions in ground truth. In the second variant, 
all detections were accounted for. Note that the videos 
denoted as extreme conditions do not contain any objects, 
therefore there were no true positives and any detected 
object was a false positive. Because of this, we present in 
the results (Table |3j only the edge of water estimation error 
and the average number of false positives (by definition, 
both, accuracy and precision, would be zero in such case). 

In terms of edge of water estimation, the UGM co i slightly 
outperforms the SSM. The UGM co i ignores the spatial 
information and generally oversegments the regions close 
to the shoreline (as seen in Figure llj, which in this case 
actually reduces the error compared to SSM. The reason is 
that the SSM attributes the upper part of the sun reflec¬ 
tion at the shoreline in video 11 (Figure [6} to an obstacle 
instead of the water. When ignoring the glitter region, the 
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Fig. 12. Qualitative examples of water segmentation and obstacle detection. We show for each image the detected edge of the sea in green and 
the detected obstacle by a yellow rectangle. Below each image we also show the spatial part of the three semantic components as three Gaussian 
ellipses and the portion of the image segmented as water in blue. This figure is best viewed in color. 


SSM outperforms the competing methods by not detecting 
any false positives (zero aFP xgnore ), while the competing 
methods exhibit larger values of the false positives. When 
considering also the glitter region, the number of false 
positives only slightly increases for the SSM, while this 
increase is considerable for the other methods. Note that 
in this case the SSM again significantly outperforms the 
other methods, except for SPX. The reason is that the SPX 
actually fails by grossly oversegmenting the water region, 
thus assigning almost all glitter to that region. However, 
looking at the results of the edge estimation, we can also see 
that this oversegmentation actually consumes also a part 
of the shoreline, thus leading to poor overall segmentation. 
Among the remaining methods, the SSM again achieves 
the lowest average false positive rate. Given these results 
we conclude that the SSM is much more robust to extreme 
conditions than the competing methods, while still offering 
good segmentation results. Some examples of segmentation 
with SSM are shown in the last four images of Figure [12] 
Even in these harsh conditions the model is able to interpret 
the scene well enough with few false obstacle detections. 
For more illustrative examples of our method and seg¬ 
mentations, please consult the additional online material at 
http: / /box.vicos.si/matejk/smc/index.htm 


TABLE 3 

Comparison of various methods under extreme conditions. We show 
the results for edge of water estimation error Edg and average false 
positives when ignoring glitter and when counting the glitter as false 
positives, aFP ignore and aFP accou nt , respectively. 



Edg[pix] 

nFP- 

li± ± ignore 

^-P-Paccount 

SSM 

11 .3(10.0) 

0.000 

0.134 

GCM 

16.1(15.3) 

0.010 

0.919 

UGM 

12.4(11.3) 

0.007 

0.932 

UGM col 

8.1(5.1) 

0.019 

0.308 

FZH 

65.3(48.7) 

0.003 

0.233 

FZHfun 

46.4(51.6) 

0.159 

2.056 

SPX 

49.2(48.6) 

0.015 

0.019 


6.2.3 Failure cases 

An example of conditions is which the segmentation is 
expected to fail is shown in the bottom-most right image 
of Figure [12] In this image, the boat is facing a low-laying 
sun directly, which results in a large saturated glitter on 
the water surface. Since the glitter occupies a large region, 
and is significantly different from the water, it is detected 
as an obstacle. Such cases could be handled by image 
postprocessing, but at a risk of missing true detections. 
Nevertheless, additional sensors like compass, IMU and 
sun-position model can be used to identify a detected region 
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Fig. 13. Qualitative examples of poor segmentation. 

as a potential glitter. To offer further insights of the con¬ 
straints of the proposed segmentation, we show additional 
failure cases in Figure [13] Figure [F3j i shows failure due to a 
strong reflection of the landmass in the sea, while Figure p3]o 
shows an example of failure due to blurred transition from 
the sea to sky. Note that in both cases, the edge of the 
sea is conservatively estimated, meaning that true obstacles 
were not mislabelled as water, but rather portions of water 
were labelled as obstacle. Figure [13]: shows an example in 
which several obstacles are close-by and are not detected 
as separate obstacles, but rather as part of the edge of 
water. An example of potentially dangerous mislabelling 
is shown in Figure [T3| d, where a part of the boat on the 
left is deemed visually-similar to water and is labelled as 
such. Note, however, that this mislabelling is corrected in 
the subsequent images in that video. 

6.3 Effects of the target size 

Note that all obstacles may not pose equal threat to the 
vessel. In fact, smaller objects are likely not people and 
may also likely pose little threat, since they can be run 
over without damaging the vessel. To inspect our results 
in such a context, we have compiled the SSM results over 
all videos with respect to the minimum object's size. Any 
object, whether in the ground truth or in detection, was 
ignored if its size was smaller than a predefined value. We 
also ignored any detected object that overlapped with the 
removed ground truth detection by 0.3. This last condition 
addresses the fact that some objects in the ground truth 
are slightly smaller than their detected size, which would 
generate an incorrect false positive if the ground truth object 
was removed. Figure |14| visualizes the applied thresholds, 
while the results are given in Table [3] 

The results show that the detection remains high over a 
range of small thresholds, which speaks of a level of robust¬ 
ness of our approach. By increasing thresholds above 10 x 10 
the precision as well as the recall increase the probability 
of detecting a false positive in a given frame is drastically 
reduced. This means that, as the objects approach the USV 
and get bigger, they are increasingly reliably detected. This 
is also true for the sufficiently big objects that are far away 
from the USV. The following rule-of-thumb calculation for 


the big or approaching objects can be performed. Let us 
assume that a successful detection means any detection of 
a true obstacle if we detect it at least once in 7Vb u f = 3 
consecutive frames. The probability of a successful detection 
is therefore 

p S ucce SS = l-(l-Rec) iVb “ t . (14) 

If we do not apply any thresholding, we can detect any 
object, regardless of its size with probability 0.988. The 
probability of a false positive occurring in any image is 
0.055. By applying a small 3x3 threshold, the detection 
remains unchanged, but the probability of a false positive 
occuring in a particular frame goes down to 0.05. If we 
chose to focus only on the objects that are at least thirty 
by thirty pixels large, then the probability of detection 
goes up to 0.992, and the probability of detecting a false 
positive in any frame goes down to 0.01. It should be noted 
that the model in {14} assumes independence of detections 
over the sequence of images. While such assumptions may 
indeed be restrictive for temporal sequences, we still believe 
that the model gives a good rule-of-thumb on expected 
real-life obstacle detection performance of the segmentation 
algorithm. 

TABLE 4 

The results of the SSM aXa and related approaches on all video 
sequences with respect to the minimum object size ax a. For 
reference, the results for top-performing baselines are provided for 
5 x 5, 15 x 15 and 30 x 30 pixels. 



Prec 

Rec 

F 

aFP 

SSMqxo 

0.885 

0.772 

0.819 

0.055 

SSM 3x3 

0.898 

0.772 

0.825 

0.049 

ssm 5x5 

0.898 

0.772 

0.825 

0.049 

SSMioxio 

0.898 

0.773 

0.826 

0.049 

SSMisxis 

0.896 

0.792 

0.837 

0.049 

SSM30X30 

0.924 

0.801 

0.846 

0.010 

GCM 5X 5 

0.733 

0.686 

0.703 

0.246 

GCM15X15 

0.731 

0.701 

0.701 

0.246 

GCM 30 x30 

0.812 

0.759 

0.772 

0.068 

fzh 5x5 

0.731 

0.523 

0.553 

0.082 

FZHi5xi5 

0.731 

0.536 

0.565 

0.082 

FZHsoxso 

0.779 

0.577 

0.614 

0.038 

spx 5x5 

0.007 

0.001 

0.001 

0.078 

SPXi5xi5 

0.007 

0.001 

0.001 

0.078 

SPX 30 x30 

0.003 

0.001 

0.001 

0.074 


7 Discussion and conclusion 

A graphical model for semantic segmentation of marine 
scenes was presented and applied to USV obstacle-map 
estimation. The model exploits the fact that scenes a USV en¬ 
counters may be decomposed into three dominant visually- 
and semantically-distinctive components, one of which is 
the water. The appearance is modelled by a mixture of Gaus- 
sians and accounts for the outliers by a uniform component. 
The geometric structure is enforced by placing weak priors 
over the component means. A MRF model is applied on 
prior and posterior pixel-label distribution to account for the 
interactions across neighboring pixels. An EM algorithm is 
derived for fitting the model to image, which affords fast 
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Fig. 14. Visualization of minimum thresholds used in Table[4] 

convergence and efficient implementation. The proposed 
model directly applies straight-forward features, i.e., color 
channels and pixel positions and avoids potentially slow ex¬ 
traction of more complex features. Nevertheless, the model 
is general enough to be directly applied without modifi¬ 
cations to any other features. A straightforward approach 
for estimation of the weak prior model was proposed, that 
allows learning from a small number of training images 
and does not require accurate annotations. Results show 
excellent performance compared to related segmentation 
approaches and exhibits improved performance in terms of 
segmentation accuracy as well as speed. 

To evaluate the performance and analyze our algorithm, 
we have compiled and annotated a new real-life coastal 
line segmentation dataset captured from an onboard marine 
vehicle camera. This is the largest dataset of its kind to 
date and is as such another contribution to the field of 
robotic vision. We have studied the effects of the colorspace 
selection on the algorithm's performance. We conclude that 
the algorithm is fairly robust to this choice, but obtains best 
results at YCrCb and Lab colorspaces. The experimental 
results also show that the proposed algorithm significantly 
outperforms the related solutions. While the algorithm pro¬ 
vides high detection rates at low false positives it does so 
with a minimal processing time. The speed comes from 
the fact that the algorithm can be implemented through 
convolutions and from the fact that it preforms robustly on 
small images. The results have also shown that the proposed 
method outperforms the related methods by a large margin 
in terms of robustness in the extreme conditions, when the 
vehicle is facing the sun, as well. To make the present paper 
a reproducible research and to facilitate other researchers in 
comparing their work to ours, the Modd dataset is made 
publicly available, along with all the Matlab evaluation rou¬ 
tines, a reference Matlab implementation of the presented 
approach and the routines for learning the weak priors. 

Note that the fast performance is of crucial importance 
for real-life implementations on US Vs, as it allows the use 
in onboard embedded controllers and low-cost embedded, 
low-resolution cameras. Our future work will focus on two 
extensions of our algorithm. We will explore possibilities of 
porting our algorithm to such an embedded sensor. Since 
many modern embedded devices contain GPUs, we will 
also explore parallelization of our algorithm by exploiting 
the fact that it is based on convolution operations, which can 


be efficiently parallelized. Our model is fully probabilistic 
and as such affords a principled way for information fusion, 
e.g., j45j, to improve performance. We will explore com¬ 
binations with additional external sensors such as inertial 
sensors, cameras of other modalities and stereo systems. In 
particular, IMU can be used to modify the priors and soft 
reset parameters on-the-fly as well as estimating the position 
of the horizon in the images. The segmentation model can 
then be constrained by hard-assigning pixels above the 
horizon to the non-water class. Temporal constraints on 
segmentation can be further imposed by image-based ego- 
motion estimation using techniques from structure-from- 
motion. 
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